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FAST LEARNING RATES FOR PLUG-IN CLASSIFIERS 

By Jean- Yves Audibert and Alexandre B. Tsybakov 
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It has been recently shown that, under the margin (or low noise) 
assumption, there exist classifiers attaining fast rates of convergence 
of the excess Bayes risk, that is, rates faster than n -1 / 2 . The work on 
this subject has suggested the following two conjectures: (i) the best 
achievable fast rate is of the order n -1 , and (ii) the plug-in classifiers 
generally converge more slowly than the classifiers based on empirical 
risk minimization. We show that both conjectures are not correct. 
In particular, we construct plug-in classifiers that can achieve not 
only fast, but also super-fast rates, that is, rates faster than n _1 . 
We establish minimax lower bounds showing that the obtained rates 
cannot be improved. 



A 



1. Introduction. Let (X,Y) be a random couple taking values in Z 
H d x {0, 1} with joint distribution P. We regard X £ H d as a vector of fea- 
tures corresponding to an object and Y £ {0, 1} as a label indicating that the 
object belongs to one of two classes. Consider the sample (Xi,Y\), . . . , (X n ,Y n ), 
where pQ, Y) are independent copies of (X, Y). We denote by P® n the prod- 
uct probability measure according to which the sample is distributed, and 
by Px the marginal distribution of X. 

The goal of a classification procedure is to predict the label Y given the 
value of X, that is, to provide a decision rule / : R d — > {0, 1} which belongs 
to the set J- of all Borel functions defined on H d and taking values in {0, 1}. 
The performance of a decision rule / is measured by the misclassification 
error 

R(f) = P(Y^f(X)). 

The Bayes decision rule is a minimizer of the risk R(f) over all decision rules 
/ £ J-, and one of such minimizers has the form f*(X) = l{ r/ (A')>i/2} > where 



Received July 2005; revised April 2006. 

AMS 2000 subject classifications. Primary 62G07; secondary 62G08, 62H05, 68T10. 
Key words and phrases. Classification, statistical learning, fast rates of convergence, 
excess risk, plug-in classifiers, minimax lower bounds. 

This is an electronic reprint of the original article published by the 

Institute of Mathematical Statistics in The Annals of Statistics, 

2007, Vol. 35, No. 2, 608-633. This reprint differs from the original in pagination 

and typographic detail. 



1 



2 



J.-Y. AUDIBERT AND A. B. TSYBAKOV 



denotes the indicator function and rj(X) = P(Y = 1\X) is the regression 
function of Y on X [here P{dY\X) is a regular conditional probability, which 
we will use in the following without further mention]. 

An empirical decision rule (a classifier) is a random mapping f n : Z n — ► T 
measurable w.r.t. the sample. Its accuracy can be characterized by the excess 
risk, 

(1.1) £{f n ) = ER(f n ) - R(f*) = E(\2 V (X) - l\t {Uw{x)} ) 

where E denotes expectation. A key problem in classification is to construct 
classifiers with small excess risk (cf. [8, 24]). Optimal classifiers can be de- 
fined as those having the best possible rate of convergence of £(f n ) to 0, as 
n — > oo. Of course, this rate, and thus the optimal classifier, depend on the 
assumptions on the joint distribution of (X,Y). A standard way to define 
optimal classifiers is to introduce a class of joint distributions of (X, Y) and 
to declare f n optimal if it achieves the best rate of convergence in a minimax 
sense on this class. 

Two types of assumptions on the joint distribution of (X, Y) are com- 
monly used: complexity assumptions and margin assumptions. 

Complexity assumptions are stated in two possible ways. The first of them 
is to suppose that the regression function n is smooth enough or, more 
generally, belongs to a class of functions S having a suitably bounded e- 
entropy. This is called a complexity assumption on the regression function 
(CAR). Most commonly it is of the following form. 

Assumption (CAR). The regression function rj belongs to the class X 
of functions on Tl d such that 

H(e,'Z,Lp)<A*e- p Ve > 0, 

with some constants p > 0, A* > 0. Here H(e, T,,L p ) denotes the e-entropy 
of the set X w.r.t. an L p norm with some 1 < p < oo. 

Recall that the metric entropy S,L p ) is the logarithm of the mini- 
mum number of L p -balls of radius e covering the set X [10]. 

At this stage of discussion we do not identify precisely the value of p for 
the L p norm in Assumption (CAR), or the measure with respect to which 
this norm is defined. Examples will be given later. If X is a class of smooth 
functions with smoothness parameter j3 on a compact in R rf , for example, a 
Holder class, as described below, a typical value of p in Assumption (CAR) 
is p = dj j3. 

Assumption (CAR) is well adapted for the study of plug-in rules, that is, 
of the classifiers having the form 

(I- 2 ) fn l i X ) = 1 {r'i n (X)>l/2], 
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where fj n is a nonparametric estimator of the function 77. Indeed, Assump- 
tion (CAR) typically reads as a smoothness assumption on 77, implying that 
a good nonparametric estimator (kernel, local polynomial, orthogonal se- 
ries or other) fj n converges with some rate to the regression function 77, as 
n — > 00. In turn, closeness of fj n to 77 implies closeness of f n to /: for any 
plug-in classifier f^ 1 we have 

(1.3) ^Rifn 1 ) ~ R(fl < 2E| \i) n {x) - r,(x)\P x (dx) 

(cf. [8], Theorem 2.2). For various types of estimators fj n and under rather 
general assumptions it can be shown that, if Assumption (CAR) holds, the 
RHS of (1.3) is uniformly of the order n _1 ^ 2+p ^, and thus 

(1.4) sup £(/£ I ) = («~ 1/(2+p) )> twoo 

(cf. [26]). In particular, if p = d/(3 (which corresponds to a class of smooth 
functions with smoothness parameter /?), we get 

(1.5) sup 5(/ n PI )=0(n-^ +d) ), twoo. 

P:?7€£ 

Note that (1.5) can be easily deduced from (1.3) and standard results on 
the L\ or L2 convergence rates of usual nonparametric regression estimators 
on /^-smoothness classes E. The rates in (1.4), (1.5) are quite slow, always 
slower than n" 1 / 2 . In (1.5) they deteriorate dramatically as the dimension d 
increases. Moreover, Yang [26] showed that, under general assumptions, the 
bound (1.5) cannot be improved in a minimax sense. These results raised 
some pessimism about the plug-in rules. 

The second way to describe complexity is to introduce a structure on the 
class of possible decision sets G* = {x : f*(x) = 1} = {x : rj(x) > 1/2} rather 
than on that of regression functions 77. A standard complexity assumption 
on the decision set (CAD) is the following. 

Assumption (CAD). The decision set G* belongs to a class Q of sub- 
sets of R d such that 

H{e,G,d A )<A,e~ p Ve > 0, 

with some constants p > 0, A* > 0. Here TC(e,Q,d/\) denotes the e-entropy 
of the class Q w.r.t. the measure of symmetric difference pseudo-distance 
between sets defined by d&(G, G') = Px(GAG') for two measurable subsets 
G and G' in TL d . 

The parameter p in Assumption (CAD) typically characterizes the smooth- 
ness of the boundary of G* (cf. [20]). Note that, in general, there is no con- 
nection between Assumptions (CAR) and (CAD). Indeed, the fact that G* 
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has a smooth boundary does not imply that rj is smooth, and vice versa. 
In Assumption (CAD), the values of p closer to correspond to smoother 
boundaries (less complex sets G*). As a limit case when p — ► one can con- 
sider the Vapnik-Chervonenkis classes (VC-classes) for which the e-entropy 
is logarithmic in l/e. 

Assumption (CAD) is suited for the study of empirical risk minimization 
(ERM) type classifiers introduced by Vapnik and Chervonenkis [25]; see 
also [8, 24]. As shown in [20], for every < p < 1 there exist ERM classifiers 
jERM guch that, under Assumption (CAD), 

(1.6) sup f(/ n ERM )=0(n- 1 / 2 ), rwoo. 
p-.G*eg 

The rate of convergence in (1.6) is better than that for plug-in rules [see (1.4) 
and (1.5)] and it does not depend on p (resp., on the dimension d). Note that 
the comparison between (1.6) and (1.4) and (1.5) is not quite legitimate, be- 
cause classes of joint distributions P of (X, Y) satisfying Assumption (CAR) 
are different from those satisfying Assumption (CAD). Nevertheless, such a 
comparison has been often interpreted as an argument in disfavor of the 
plug-in rules. Indeed, Yang's lower bound shows that the n -1 / 2 rate cannot 
be attained under Assumption (CAR) suited for the plug-in rules. Recently, 
advantages of the ERM type classifiers, including penalized ERM methods, 
have been further confirmed by the fact that, under the margin (or low 
noise) assumption, they can attain fast rates of convergence, that is, rates 
that are faster than n" 1 / 2 [1, 11, 14, 15, 20, 22]. 

The margin assumption (or low noise assumption) is stated as follows. 

Assumption (MA). There exist constants Co > and a > such that 

(1.7) P x (0<\ri(X)-l/2\<t)<Cot a Vt>0. 

The case a = is trivial (no assumption) and is included for notational 
convenience. The other extreme case a = oo is most advantageous for clas- 
sification: the regression function r\ is bounded away from 1/2. Assumption 
(MA) provides a useful characterization of the behavior of the regression 
function r/ in the vicinity of the level rj = 1/2, which turns out to be crucial 
for convergence of classifiers. Note that the margin assumption does not af- 
fect the complexity of the class of regression functions, but it affects the rate 
of convergence of the excess risk due to its structure. This can be seen from 
the following simple argument which underlies our results. For any 5 > 
from (1.1) and Assumption (MA) we get 

ER(f^) - R(f) < 25P X (0 < \ V (X) - 1/2| < 6) 

(1-8) + E(|2r ? (X) - ^ { fPi ix) ^ f * ix)} ^{\ v (x)-i/2\>S}) 

< 2C 6 1+a + 2E(\UX) ~ v(X)\H\« n (x)-r,{x)\>6}), 
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where in the last inequality we have used the fact that \t)(X) — 1/2 1 < 
\fj n (X) - 7](X)\ on the set {X:f^(X) ^ f*(X)}. Thus, the excess risk is 
decomposed into two terms: 2Cq5 1+01 which is determined by the margin 
assumption and reflects the behavior of r\ near the decision boundary, and 
the second term which characterizes the regression estimation error. Opti- 
mal convergence is essentially obtained by choosing the 5 which balances the 
two terms. Fast rates are possible because the second term in (1.8) decreases 
exponentially in 5 for several types of regression estimators fj n . 

For more discussion of the margin assumption see [20] and the survey [7] . 
The main point is that, under Assumption (MA), fast classification rates 
up to n _1 are achievable. In particular, for every < p < 1 and a > there 
exist ERM type classifiers /^ RM such that 

(1.9) SUp Si fERM )=0{n ~(l +a )/(2 + a+a P )^ 

P: (CAD), (MA) 

where sup p . (QAD) (MA) denotes the supremum over all joint distributions 
P of (X,Y) satisfying Assumptions (CAD) and (MA). The RHS of (1.9) can 
be arbitrarily close to C^n" 1 ) for large a and small p. Result (1.9) for direct 
ERM classifiers on e-nets is proved by Tsybakov [20], and for some other 
ERM type classifiers by Tsybakov and van de Geer [22], Koltchinskii [11] 
and Audibert [1] [in some of these papers the rate of convergence (1.9) is 
obtained with an extra log- factor]. 

Comparison of (1.6) and (1.9) with (1.4) seems to support the conjecture 
that the plug-in classifiers are inferior to the ERM type ones. The main 
message of the present paper is to disprove this. We will show that there 
exist plug-in rules converging with fast rates, and even with super-fast rates, 
that is, faster than n~ l under the margin Assumption (MA). The basic idea 
of the proof is to use arguments similar to (1.8) combined with exponential 
inequalities for the regression estimator fj n (see Section 3 below) or the 
convergence results in the norm (see Section 5), rather than the usual 
L\ or L2 norm convergence of rj n as previously described [cf. (1.3)]. On the 
other hand, the super-fast rates are not attainable for ERM type rules or, 
more precisely, under Assumption (CAD), which serves for the study of ERM 
type rules. In fact, the lower bound of [15] shows that the rates cannot be 
faster than (logn)/n even for smaller classes than those satisfying (CAD). 

It is important to note that our results on fast rates cover more gen- 
eral settings than just classification with plug-in rules. These are rather 
results about classification in the regression complexity context under the 
margin assumption. In particular, we establish minimax lower bounds valid 
for all classifiers, and we construct a "hybrid" plug-in/ERM procedure (i.e., 
a procedure performing ERM on a set of plug-in rules coming from an ap- 
propriate grid on the set of regression functions) that achieves optimality. 
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Thus, the point is mainly not about the type of procedure (plug-in or ERM) 
but about the type of complexity assumption [on the regression function 
(CAR) or on the decision set (CAD)] that should be natural to impose. 
Assumption (CAR) on the regression function arises in a natural way in 
the analysis of several practical procedures of plug-in type, such as various 
versions of boosting or SVM (cf. [3, 5, 6, 17, 19]). These procedures are now 
being intensively studied, but, to our knowledge, only suboptimal rates of 
convergence have been proved in the regression complexity context under 
the margin assumption. The results in Section 4 (see also Section 5) estab- 
lish the optimal rates of classification under Assumption (CAR) and show 
that they are attained for a "hybrid" plug-in/ERM procedure. Expectedly, 
the same rates should be achievable for other plug-in type procedures, such 
as boosting. 

2. Notation and definitions. In this section we introduce some notation, 
definitions and basic facts that will be used in the paper. 

We denote by C, C\, C2, ■ ■ ■ positive constants whose values may differ 
from line to line. The symbols P and E stand for generic probability and 
expectation, and Ex is the expectation w.r.t. the marginal distribution Px- 
We denote by B(x,r) the closed Euclidean ball in R d centered at x £ R d 
and of radius r > 0. 

For any multi-index s = (s\, . . . , sj) £ N d and any x = (x±, . . . , xj) £ R d , 

we define |s| = Ya=i s «; s! = si! • • • s^!, x s = x^ 1 ■ ■ ■ x s d d and ||x|| = (x\ + • • • + 

a;?) 1 / 2 . Let D s denote the differential operator D s = f I? * % ■ 
a ' ax 1 ■■■ax," 

Let > 0. Denote by [f3\ the maximal integer that is strictly less than 

j3. For any x £ R rf and any |_/3J -times continuously differentiable real- valued 

function g on R rf , we denote by g x its Taylor polynomial of degree [/3\ at 

point x, 

M<L/3J s ' 

Let L > 0. The (/?, L,~R d )- Holder class of functions, denoted S(/3,L,R d ), 
is defined as the set of functions g : R^ — ► R that are [{3\ times continuously 
differentiable and satisfy, for any x,x' £ R rf , the inequality 

\g(x')-g x (x')\<L\\x-x'f. 

Fix some constants cq, ro > 0. We will say that a Lebesgue measurable set 
A C R d is (co,ro) -regular if 

(2.1) \[AnB(x,r)]>c \[B(x,r)] V0 < r < r , Vx £ A, 

where X[S] stands for the Lebesgue measure of S C K d . To illustrate this 
definition, consider the following example. Let d>2. Then the set A = {x = 
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(x\, . . . ,Xd) £ R d : Z)j=i l^j'l 9 — 1} is ( c 5 r o) -regular with some co,ro > for 
<7 > 1, and there are no cq, ro > such that A is (co, ro) -regular for < q < 1. 

Introduce now two assumptions on the marginal distribution Px that will 
be used in the sequel. 

Definition 2.1. Fix < co,r ,// max < oo and a compact CcR d . We 
say that the mild density assumption is satisfied if the marginal distribution 
Px is supported on a compact (co, ro)-regular set A<ZC and has a uniformly 
bounded density /x w.r.t. the Lebesgue measure: fj,(x) < /U ma x, Vx £ A 

Definition 2.2. Fix some constants cq, ro > and < /x m i n < /U max < oo 
and a compact C C R d . We say that the strong density assumption is satisfied 
if the marginal distribution Px is supported on a compact (cq, r )-regular 
set A C C and has a density |U w.r.t. the Lebesgue measure bounded away 
from zero and infinity on A: 

^min < — Mmax f° r x £ A and = otherwise. 

We finally recall some notions related to local polynomial estimators. 

Definition 2.3. For h > 0, x € R rf , for an integer I > and a function 
K : H d — > R + , denote by 6 X a polynomial on R rf of degree I which minimizes 

n 

(2.2) J2\ri-e x (Xi-x)] 2 K 

i=l 

The Zoca/ polynomial estimator f)n P (x) of order I, or LP(Z) estimator, of the 

value rj(x) of the regression function at point x is defined by ffc* (x) = X (0) 

if 9 X is the unique minimizer of (2.2) and fj^ F (x) = otherwise. The value h 
is called the bandwidth and the function K is called the kernel of the LP(Z) 
estimator. 

Let T s denote the coefficients of 9 X indexed by the multi-index s G N d , 
§x{u) = J2\ s \<i T s uS - Introduce the vectors T = (T s )| s |< h V = (V^)| s |</, where 

n 

(2.3) V s = Y,Yi{Xi-x) s K 

i=l 

U(u) = (n s )| s |<; and the matrix Q = (<9 Sl)S2 )| Sl |,| S2 |</, where 
(2-4) Q Sl , S2 =J2(Xi-x) sl+S2 K 

i=l 

The following result is straightforward (cf. Section 1.7 in [21] where the case 
d = 1 is considered). 






<s 



J.-Y. AUDIBERT AND A. B. TSYBAKOV 



Proposition 2.1. If the matrix Q is positive definite, there exists a 
unique polynomial on R d of degree I minimizing (2.2). Its vector of coeffi- 
cients is given byT = Q~ l V and the corresponding LP(Z) regression function 
estimator has the form 

r??{x) = U T (0)Q-'V = £ YiK^^j U T (0)Q-'U(Xi - x). 

3. Fast rates for plug-in rules under the strong density assumption. We 

first state a general result showing how the rates of convergence of plug-in 
classifiers can be deduced from exponential inequalities for the corresponding 
regression estimators. 

Lemma 3.1. Let fj n be an estimator of the regression function rj and V 
a set of probability distributions on Z such that for some constants C\ > 0, 
C-2 > 0, for some positive sequence a n , for n > 1 and any 5 > 0, and for 
almost all x w.r.t. P x , we have 

(3.1) sup P® n (\f, n {x) - r](x)\ >5)<d exp(-C7 2 a n( 5 2 ). 

Consider the plug-in classifier f n = l{^ n>1 / 2 }- If oil the distributions P & V 
satisfy the margin Assumption (MA), we have 

sup{BR(f n )-R(n}<Ca-^/ 2 
Pev 

for n > 1 with some constant C > depending only on a, Co, C\ and C 2 . 
PROOF. Consider the sets Aj C R d , j = 1, 2, . . . , defined as 
A = {x E~R d :0 <\rj(x) - ±\ < 5}, 

Aj = {xe~R d : 2 j ~ 1 5 < \r)(x) - || < 2 j 5} for j > 1. 
For any 5 > 0, we may write 

ER(f n ) - R(n = B(\2rj(X) - l\t {Uw{x)} ) 

oo 

= ^E(|2 ?? (X)-l|l {/n(W(x)} l {XeAj} ) 

3=0 

(3-2) 

<2SP x (0<\ V {X)-±\<6) 

+ J2 E (\MX)-l\l {Uxw , {x)} l { xeA 3 })- 

3>l 
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On the event {/„ 7^ /*} we have \r) — 3 1 < \f}n — So, for any j > 1, we get 
E(|2r ? (X)-l|l {/nW/ , (x)} l {XeA . } ) 

< 2- 7 ' +1 5E[l{|^ n ( X )_ j; ( X )|>2j-i 5 }l{o<^(X)-l/2|<23 5}] 

< y+ l 5E x [P® n {\UX) - V(X)\ > 2 J " 1 5)l {0 <| ?) (x)-i/2|<2 J( 5}] 

< Ci2^ +1 (5exp(-C2a„,(2J- 1 (5) 2 )Px(0 < \rj(X) -\\< 2^6) 

< 2C 1 C 2 j{1+a) 6 1+a exp(-C 2 o n (2 J '- 1 5) 2 ), 

where in the last inequality we have used Assumption (MA). Now, from 

— 1/2 

inequality (3.2), taking 5 = a n and using Assumption (MA) to bound the 
first term of the right-hand side of (3.2), we get 

BR(f n ) - R{f*) < 2C a~ { - 1+a ^ 2 + Ca~^ 1+a ^ 2 £ 2^ 1+a ) exp(-C 2 2 2 ^ 2 ) 

i>2 

<C<( 1+a )/ 2 . □ 

Inequality (3.1) is crucial to obtain the above result. This inequality holds 
for various types of estimators and various sets of probability distributions 
V . Here we focus on the standard case where 77 belongs to the Holder class 
S(/3,L,R rf ) and the marginal law of X satisfies the strong density assump- 
tion. We are going to show that in this case there exist estimators satisfying 
inequality (3.1) with a n = n 2 P/( 2 P+ d ). These can be, for example, locally 
polynomial estimators. Specifically, assume from now on that K is a kernel 
satisfying 

(3.3) 3c>0: K{x) > d{|| x ||< c } Vx G R d , 



(3.4) / K(u)du = l, 

J-R d 

(3.5) / (1 + \\u\\ A ^)K 2 {u)du< 00, 

jR d 

(3.6) sup (l + \\u\\ 2( ^)K(u) <oo. 

Let h > 0, and consider the matrix B = {B~s 1 ,s2)\si\,\s 2 \<[J3\ > where B Sl ,s 2 = 
jj^ Sr=i(^T £ )' Sl+ ' S2 ^(^ir £ )- De fi ne the regression function estimator 77* as 
follows. If the smallest eigenvalue of the matrix B is greater than (logra)" 1 we 
set 77* (x) equal to the projection of t)^ p (x) on the interval [0, 1] , where i)]f(x) 
is the LP(|_/3J) estimator with bandwidth h > and kernel K satisfying 
(3.3)-(3.6). If the smallest eigenvalue of B is less than (logra) -1 we set 
77* (x) =0. 
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Theorem 3.2. LetV be a class of probability distributions P on Z such 
that the regression function n belongs to the Holder class E(/3,L,R d ) and the 
marginal law of X satisfies the strong density assumption. Then there exist 
constants C±, C2, C3 > such that for any < h < r^/c, any C^h^ < 5 and 
any n > 1 the estimator 77* satisfies 

(3.7) supP® n (\fi* n (x) -ri(x)\ >5)<Ciexp(-C 2 nh d 6 2 ) 

for almost all x w.r.t. Px- As a consequence, there exist C\,C2 > such 
that for h = n -1 ^ 2 ^^ and any 5 > 0, n > 1 we have 

(3.8) sup P® n (\r)*(x) - n{x)\ >5)<d exp(-C 2 n 2 ^ 2/3+, V) 

P<EV 

for almost all x w.r.t. Px- The constants C\, 62,63 depend only on (3, d, 
L, Co, ro, ^min, A*max, and on the kernel K. 

The proof is given in Section 6.1. 

Remark 3.1. We have chosen here the LP estimators of rj, because for 
them the exponential inequality (3.1) holds without additional smoothness 
conditions on the marginal density of X. For other popular regression es- 
timators, such as kernel or orthogonal series ones, a similar inequality can 
also be proved if we assume that the marginal density of X is as smooth as 
the regression function. 

Definition 3.1. For a fixed parameter a > 0, fixed positive parameters 
Co, ro, Co, (3, L, /x max > /U m i n > and a fixed compact C C R d , let Vy. denote 
the class of all probability distributions P on Z such that: 

(i) the margin Assumption (MA) is satisfied, 

(ii) the regression function 77 belongs to the Holder class £(/?, L,R d ), 

(iii) the strong density assumption on Px is satisfied. 

Lemma 3.1 and (3.8) immediately imply the next result. 

Theorem 3.3. For any n > 1 the excess risk of the plug-in classifier 
fn = ^-{fj^>\/2} with bandwidth h = n~ 1 ^ 2l3+d ^ satisfies 

sup {ER{f*) - R(f*)} < Cn^ 1+aS >/^ +d \ 

where the constant C > depends only on a, Cq, C\ and Ci- 

For aft > d/2 the convergence rate n _ ^ 1+a )/( 2/3+d ) obtained in Theorem 
3.3 is a fast rate, i.e., it is faster than n" 1 / 2 . Furthermore, it is a super- 
fast rate (i.e., is faster than n _1 ) for af3 > d. We must note that if this 
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condition is satisfied, the class Vy is rather poor, and thus super-fast rates 
can occur only for very particular joint distributions of (X,Y). Intuitively 
this is clear. Indeed, to have a very smooth regression function r/ (i.e., very 
large (3) implies that when rj hits the level 1/2, it cannot "take off" from this 
level too abruptly. As a consequence, when the density of the distribution 
Px is bounded away from in the vicinity of the hitting point, the margin 
assumption cannot be satisfied for large a since this assumption puts an 
upper bound on the "time spent" by the regression function near 1/2. So, 
a and (3 cannot be simultaneously very large. It can be shown that the case 
of simultaneously large a and (3 is essentially described by the condition 
a(3 > d. 

To be more precise, observe first that Vy is not empty for a(3 > d, so that 
super-fast rates can effectively occur. Examples of laws P 6 Vy under this 
condition can be easily given, such as the one with Px equal to the uniform 
distribution on a ball centered at in R d , and the regression function defined 
by rj{x) = 1/2 — C||x|| 2 with an appropriate C > 0. Clearly, 7] belongs to 
Holder classes with arbitrarily large (3 and Assumption (MA) is satisfied 
with a = d/2. Thus, for d > 3 and (3 large enough super-fast rates can occur. 
Note that in this example the decision set {x:rj(x) > 1/2} has Lebesgue 
measure in H d . It turns out that such a condition is necessary to achieve 
classification with super-fast rates when the Holder classes of regression 
functions are considered. 

To explain this we need a definition. We will say that rj crosses the level 
1/2 at a point xq £ R rf if for any r > there exist x_ and x + in B(xo,r) 
such that rj(x-) < 1/2 and rj{x + ) > 1/2. 

Proposition 3.4. If a(l A (3) > 1 there is no distribution P E Vs such 
that the regression function rj associated with P crosses 1/2 in the interior 
of the support of Px ■ 

Proof of this proposition is given in [2] . 

Note that the condition a(l A/3) > 1 appearing in Proposition 3.4 is equiv- 
alent to 2?Th > ^2p+d^ ' wn i cn i s necessary to have super-fast rates. As 
a simple consequence, in this context, super-fast rates cannot occur when 
the regression function crosses 1/2 in the interior of the support. 

The following lower bound shows optimality of the rate of convergence 
for the Holder classes obtained in Theorem 3.3. 

Theorem 3.5. Let d>l be an integer, and let L,j3,a be positive con- 
stants such that a/3 < d. Then there exists a constant C > such that for 
any n > 1 and any classifier f n : Z n — > T , we have 

sup {BR{f n ) - R(f*)} > c n -/3(i+«)/(2/3+^ 
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The proof is given in Section 6.2. 

Note that the lower bound of Theorem 3.5 does not cover the case of 
super- fast rates {a(5 > d). 

Finally, we discuss the case where "a = oo," which means that there exists 
to > such that 

(3.9) P x (0<|r?(X)-l/2|<t ) = 0. 

This is a very favorable situation for classification. The rates of convergence 
of the ERM type classifiers under (3.9) are, of course, faster than under 
Assumption (MA) with a < oo (cf. [15]), but they are not faster than 
Indeed, Massart and Nedelec [15] provide a lower bound showing that, even 
if Assumption (CAD) is replaced by the very strong assumption that the 
true decision set belongs to a VC-class (note that both assumptions are 
naturally linked to the study of the ERM type classifiers), the best achievable 
rate is of order (log n)/n. We show now that for the plug-in classifiers much 
faster rates can be attained. Specifically, if the regression function r] has 
some (arbitrarily low) Holder smoothness (3 and (3.9) holds, the rate of 
convergence is exponential in n. To show this, we first state a simple lemma 
which is valid for any plug-in classifier /„. 

Lemma 3.6. Let assumption (3.9) be satisfied, and letfj n be an estimator 
of the regression function rj. Then for the plug-in classifier f n = l{^ n >i/2} 
we have 

BR(f n ) - R(f*) < P(|t?„(X) - rj(X)\ > t ). 

Proof. Following an argument similar to the proof of Lemma 3.1 and 
using condition (3.9), we get 

BR(f n ) - R(f*) < 2t P x (0 < \ V (X) - 1/2| < t ) 

+ E(|277(A) - l|l { / n(X) ^ (X)} l{|r,(X)-l/2|>io}) 

= E(|2t/P0 - lIl^x^/'po^mPO-i/aiMo}) 
<P(\fj n (X)-r 1 (X)\>t ). □ 

Lemma 3.6 and Theorem 3.2 immediately imply that, under assumption 
(3.9), the rate of convergence of the plug-in classifier /* = 1{^*>i/2} with a 
small enough fixed (independent of n) bandwidth h is exponential. To state 
the result, we denote by Vs oo the class of probability distributions P defined 
in the same way as Vt,i with the only difference being that in Definition 3.1 
the margin Assumption (MA) is replaced by condition (3.9). 
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Proposition 3.7. There exists a fixed (independent of n) h>0 such 
that for any n>l the excess risk of the plug-in classifier f* = >i/2} with 
bandwidth h satisfies 

sup {Ei*(/*)-JS(r)}<C 4 exp(-C 5 n), 

where the constants C^C^ > depend only on to, f3, d, L, cq, ro, /Xmi n , /i max , 
and on the kernel K. 

PROOF. Use Lemma 3.6, choose h > such that h < min(ro/c, (to/Cs) 1 ^ 13 ) 
and apply (3.7) with 5 = to- D 

Koltchinskii and Beznosova [12] prove a result on exponential rates for 
the plug-in classifier with a penalized regression estimator in place of the 
local polynomial one that we use here. Their result is stated under a less 
general condition than Proposition 3.7, in the sense that they consider only 
the Lipschitz class of regression functions rj, while in Proposition 3.7 the 
Holder smoothness (5 can be arbitrarily close to 0. Note also that we do not 
impose any complexity assumption on the decision set. However, the class 
of distributions Vt,,oo is quite restricted in a different sense. Indeed, for such 
distributions condition (3.9) should be compatible with the assumption that 
rj belongs to a Holder class. A sufficient condition for this is the existence of a 
band or a "corridor" of zero Px- m easure separating the sets {x : rj(x) > 1/2} 
and {x : n(x) < 1/2}. We believe that this condition is close to the necessary 
one. 

4. Optimal learning rates without the strong density assumption. In 

this section we show that if Px does not admit a density bounded away 
from zero on its support the rates of classification are slower than those 
obtained in Section 3. In particular, super-fast rates, that is, the rates faster 
than n , cannot be achieved. Introduce the following class of probability 
distributions. 

Definition 4.1. For a fixed parameter a > 0, fixed positive parameters 
co, ro, Co, (3, L, /i max > and a fixed compact C C R d , let denote the class 
of all probability distributions P on Z such that: 

(i) the margin Assumption (MA) is satisfied, 

(ii) the regression function r\ belongs to the Holder class S(/3, L,R rf ), 

(iii) the mild density assumption on Px is satisfied. 

In this section we mainly assume that the distribution P of (X, Y) belongs 
to "Pg, but we also consider larger classes of distributions satisfying the 
margin Assumption (MA) and the complexity Assumption (CAR). 
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Clearly, Vy. C V'y- The only difference between and Vy is that for 
the marginal density of X is not bounded away from zero. The optimal rates 
for are slower than for Vy- Indeed, we have the following lower bound 
for the excess risk. 



Theorem 4.1. Let d>l be an integer, and let L,f3,a be positive con- 
stants. Then there exists a constant C > such that for any n>l and any 
classifier f n : Z n — > T we have 

sup {ER(f n ) - R(f*)} > c n -(i+"W((2+oO/?+<i)_ 
The proof is given in Section 6.2. 

In particular, when a = d/fi, we get the slow convergence rate 1/y/n, 
instead of the fast rate n~^ +d ^^ 2/3+d ^ obtained in Theorem 3.3 under the 
strong density assumption. Nevertheless, the lower bound can still approach 
n" 1 as the margin parameter a tends to oo. 

We now show that the rate of convergence given in Theorem 4.1 is optimal 
in the sense that there exist estimators that achieve this rate. This will be 
obtained as a consequence of a general upper bound for the excess risk of 
classifiers over a larger set V of distributions than V'y- 

Fix a Lebesgue measurable set C C H d and a value 1 < p < oo. Let £ 
be a class of regression functions rj on R d such that Assumption (CAR) is 
satisfied where the e-entropy is taken w.r.t. the L p (C,Px) norm. Then for 
every e > there exists an e-net N £ on S w.r.t. this norm such that 

log (card M £ ) < A'e~ p , 

where A' is a constant. Consider the empirical risk 

1 " 

i=l 

and set 

/ nA f n -l/(2+a+p) ) ifp = 00, 

e n -e n {a,p,p) - | n _ {p+a)/{{2+a)p+pip+a)) ^ tfi< p<00 . 

Define a sieve estimator fj^ of the regression function r\ by the relation 

(4.1) ff n GArg mini^ 

«?eAT Sn 

where = l{f)(x)>i/2}) an d consider the classifier f£ = l{^s>!/ 2 }- Note 

that can be viewed as a "hybrid" plug- in/ ERM procedure: the ERM is 
performed on a set of plug-in rules corresponding to a grid on the class of 
regression functions r\. 
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Theorem 4.2. Let V be a set of probability distributions P on Z such 
that: 

(i) the margin Assumption (MA) is satisfied, 

(ii) the regression function 77 belongs to a class £ which satisfies the 
complexity Assumption (CAR) with the e-entropy taken w.r.t. the L p (C,Px) 
norm for some 1 < p < 00. 

Consider the classifier = l{^s>i/2}- Then for any n > 1 we have 
sup{Ei?(/*) - i?(/*)} 

(4.2) 

' Cn -(l+a)/(2+a+p)^ if p= OO, 

Cn -(l+a)p/((2+a)p+p(p+a)) ^ if\<p<oO. 



< 



The proof is given in Section 6.3. 

Theorem 4.2 allows one to get fast classification rates without any density 
assumption on Px- Namely, define the following class of distributions P of 
(X,Y). 

Definition 4.2. For fixed parameters a > 0, Co > 0, f3 > 0, L > 0, and 
for a fixed compact C C R rf , let denote the class of all probability distri- 
butions P on Z such that: 

(i) the margin Assumption (MA) is satisfied, 

(ii) the regression function r\ belongs to the Holder class S(/3, L,R d ), 

(iii) for all P £ V the supports of marginal distributions Px are included 
in C. 

If C is a compact, the estimates of e-entropies of Holder classes L, H d ) 
in the L^C, A) norm where A is the Lebesgue measure on R rf are obtained 
by Kolmogorov and Tihomirov [10], and they yield Assumption (CAR) with 
p = d/{3. Therefore, from (4.2) with p = 00 we easily get the following upper 
bound. 

Theorem 4.3. Let d>l be an integer, and let L,j3 and a be positive 
constants. For any n > 1 the classifier f£ = l{^s>!/2} defined by (4.1) with 
p = 00 satisfies 

sup {ER(f%) - R{f*)} < C n -(i+")/3/((2+a)/3+d) 

PG"P° 

with some constant C > depending only on a, (3, d, L and Cq. 



Since C Pg, Theorems 3.5 and 4.3 show that n -(i+<«WWW i s the 
optimal rate of convergence of the excess risk on the class of distributions 



n. 
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5. Comparison lemmas. In this section we give some useful inequalities 
between the risks of plug-in classifiers and the L p risks of the correspond- 
ing regression estimators under the margin Assumption (MA). These in- 
equalities will be helpful in the proofs. They also illustrate a connection 
between the two complexity Assumptions (CAR) and (CAD) defined in the 
Introduction and allow one to compare our study of plug-in estimators with 
that given by Yang [26] , who considered the case a = (no margin assump- 
tion), as well as with the developments in [3] and [6]. 

Throughout this section fj is a Borel function on R rf and 

f(x) = l{fj(x)>l/2}- 

For 1 < p < oo we denote by || • || p the L p (R rf , Px) norm. We first state some 
comparison inequalities for the norm. 

Lemma 5.1. For any distribution P of (X, Y) satisfying Assumption 
(MA) we have 

(5-1) R(f)-R(f*)<2C \\fj-r ] \\ 1 + a 
and 



(5.2) Px(RX) + f*(X), v (X) + 1/2) < C \\fj - r,"° 



OO ' 



Proof. To show (5.1) note that 

R(f) - R(n = E(|2r?(X) - l\t {Rxw , {x)} ) 

< 2E(\t](X) - ^|l{o<|r,(X)-l/2|<|r,(X)-»7(A')|}) 

< 2\\ v - fjWooPxQ < \v(X) - \\ < \\r) - n\\oo) 
<2C ||r ? -f/||L +Q . 

Similarly, 

p x {Rx) + f*(x), v (x) + 1/2) <p x (o< Hx) - 1| < \r,(x) - fj(x)\) 

<Px(0<|r?(X)-||<||7 ? -77|| oo ) 
<C \\ V -fj\\to- □ 

Remark 5.1. Lemma 5.1 offers an easy way to obtain the result of The- 
orem 3.3 in a slightly less precise form, with an extra logarithmic factor in 
the rate. In fact, under the strong density assumption, there exist nonpara- 
metric estimators f] n (e.g., suitably chosen local polynomial estimators) such 
that 

/l oer? \ <?/3/(2/3+d) 

E(||ifc-*)<tf(-M Vg>0, 
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uniformly over 77 G L,R d ) (see, e.g., [18]). Taking here q = 1 + a and 
applying the comparison inequality (5.1) we immediately get that the plug- in 
classifier f n = l{jj n >i/2} has excess risk£(/ n ) of the order (ii/\ogn)~P( l+a }/C 2 P +d \ 

Another immediate application of Lemma 5.1 is to get lower bounds on 
the risks of regression estimators in the norm from the corresponding 
lower bounds on the excess risks of classifiers (cf. Theorems 3.5 and 4.1). 
But here again we lose a logarithmic factor required for the best bounds. 

Inequality (5.2) serves to compare the measure of symmetric difference 
distance between the decision sets with the distance between the cor- 
responding regression functions. In fact, if Px(f](X) = 1/2) = 0, inequality 
(5.2) reads as d A (G,G*) < C \\fj - r]\\%, where G = {x:f(x) = 1}. Thus, 
the d^-convergence rates for estimation of the true decision set G* can be 
obtained from the rates of the corresponding regression estimators. 

We now consider the comparison inequalities for L p norms with 1 < p < 
00. 

Lemma 5.2. For any 1 <p < 00 and any distribution P of (X,Y) sat- 
isfying Assumption (MA) with a > we have 



(5.3) fl(/)-fl(r)<Ci(a,p)|^-f 7 ||^ 1 



-a)/(p+a) 



where d(a,p) = 2(a +p)p~ l {l ) a /( a +p)c^° 1)/{a+p) . In particular, 

t f \ (l+a)/(2+a) 

(5.4) 12(/)-i2(r)<Ci(a,2)N[^(a;)-7/(x)] 2 Px(dx)J 

Proof. For any t > we have 

R(f) - R(f) = E[\2 V (X) - l\t {hxw , {x)} ] 

= 2E [hP0 - l/2|l{/(x)^/*(X)} 1 {0<| ?) (X)-l/2|<t}] 

+ 2E[|t/(X) - l/2\t { j {x) _ tf ^ x)} t {Mx y 1/2 \ >t} ] 
(5-5) < 2E[\ V (X) - fj(X)\l {0<ri{x) _ 1/2m } 

+ 2E[\n(X)-fj(X)\l Mx) _ fj{x)]>t} ] 

< 2\\ V - f]\\ p [Px(0 < \v(X) - 1/2| < t)]( p ~ 1)/p 
2\\rj-fj\\P 



+ 



t p-L > 

A 



by the Holder and Markov inequalities. So, for any t > 0, introducing E 
\\rj — fj\\ p and using Assumption (MA) to bound the probability in (5.5), we 
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obtain 

R(f) - R(f*) <2(ct 1)/P t a(p - 1)/p E + ^) . 

Minimizing over t the RHS of this inequality we get (5.3). □ 

If the regression function r/ belongs to the Holder class £(/3, L,TH d ) there 
exist estimators fj n such that, uniformly over the class, 

(5.6) - rj(X)} 2 } < C n -W/3+<*) 

for some constant C > 0. This has been shown by Stone [18] under the 
additional strong density assumption and by Yang [26] with no assump- 
tion on Px- Using (5.6) and (5.4) we get that the excess risk of the cor- 
responding plug-in classifier /„ = l{^ n >i/2} admits a bound of the order 
n -(2/3/(2 / 3+d))((i+a)/(2+a)) ) which is su boptimal when a ^ (cf. Theorems 
4.2, 4.3). In other words, under the margin assumption, Lemma 5.2 is not 
the right tool to analyze the convergence rate of plug-in classifiers. On the 
contrary, when no margin assumption is imposed (i.e., a = in our nota- 
tion), inequality (1.3), which is a version of (5.4) for a = 0, is precise enough 
to give the optimal rate of classification [26] . 

Another way to obtain (5.4) is to use Bartlett, Jordan and McAuliffe [3]. It 
is enough to apply their Theorem 10 with (in their notation) </>(i) = (1 — t) 2 , 
ip(t) = t 2 , and to note that for this choice of </> we have R^,{ff) — R*^ = \\t] — f]\\%. 
Blanchard, Lugosi and Vayatis [6] used the result of Bartlett, Jordan and 
McAuliffe [3] to prove fast rates of the order n~ 2( - 1+a ^ '( 3 ( 2+a )) for a boosting 
procedure over the class of regression functions i] of bounded variation in 
dimension d = l. Note that the same rates can be obtained for other plug- in 
classifiers using (5.4). Indeed, if r\ is of bounded variation, there exist estima- 
tors of r] converging with the mean squared L2 rate n -2 / 3 (cf. [9, 16, 23, 27]), 
and thus application of (5.4) immediately yields the rate n~ 2 ( 1+a ^> / ( 3 ( 2 + a )) 
for the corresponding plug-in rule. However, Theorem 4.2 shows that this 
is not an optimal rate [here again we observe that inequality (5.4) fails to 
establish the optimal properties of plug-in classifiers]. In fact, let d= 1 and 
let the assumptions of Theorem 4.2 be satisfied, where instead of assumption 
(ii) we use a particular case: 77 belongs to a class of functions on [0, 1] whose 
total variation is bounded by a constant L < 00. It follows from [4] that 
Assumption (CAR) for this class is satisfied with p = 1 for any 1 < p < 00. 
Hence, we can apply (4.2) of Theorem 4.2 to find that 

(5.7) sup{Ei?(/^) - R(f*)} < c n -(i+°)p/(( 2 +")p+(p+a)) 
PeV 

for the corresponding class V. If p > 2 (recall that the value p S [l,oo) is 
chosen by the statistician), the rate in (5.7) is faster than n~ 2 ( 1+a ^( 3 ^ 2+a ^ 
obtained under the same conditions by Blanchard, Lugosi and Vayatis [6]. 
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6. Proofs. 

6.1. Proof of Theorem 3.2. Consider a distribution P in V^. Let A 

be the support of Px ■ Fix x E A and 5 > 0. Consider the matrix B = 

(■B SljSa )| ai M S2 |<iflj with elements B sljS2 = f- Rd u Sl+S2 K(u)fi(x + hu) du. The 
smallest eigenvalue of B satisfies 

A s = min W T BW 
\\W\\=1 

(6.1) > min W T BW+ min W T (B — B)W 

\\W\\=1 || W ||=1 

> mm W T BW- £ ~ 

I*i|.|m|<Ij9J 

Let A n = {«£ R ' : \\u\\ < c;x + hu £ A} where c is the constant appearing 
in (3.3). Using (3.3), for any vector W satisfying \\W\\ = 1, we obtain 

W T BW= [ ( V K(u)u(x + hu)du 



•M<bsj 

> CU min / ^ W> a (in. 
A ™V|s|<L/3J / 



By assumption of the theorem, ch <rQ. Since the support of the marginal 
distribution is (co, ro)-regular, we get 

X[An] > h- d \[B(x,ch)nA] > c h~ d X[B(x,ch)] > c v d c d , 

where — A [6(0, 1)] is the volume of the unit ball and cq > is the constant 
introduced in the definition (2.1) of the (co, ro)-regular set. 

Let A denote the class of all compact subsets of 6(0, c) having Lebesgue 
measure cqV^ct. Using the previous displays we obtain 

(6.2) min W T BW > cu min min [( V W s u s \ du = 2u . 
\\W\\=1 ' \ m =l;S£AJs\ ls ^ m J 

By the compactness argument, the minimum in (6.2) exists and is strictly 
positive. 

For i = 1, . . . , n and any multi-indices s\,S2 such that |s2 1 < \J3\, de- 
fine 
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We have ET; = 0, \Ti\ < h~ d sup„ eR d(l + \\u\\ 2 P)K(u) = K X hr d and the fol- 
lowing bound for the variance of T^: 

1 /X - x\ 2si+2s2 ofXj-x 



Varr 4 <^E 



h 



h 



1 f u 2si+2s 2K 2^^ x + hu - )du 
h a J-Rd 

I {l + \\u\\ 4l3 )K 2 {u)du 



, f max 

Using Bernstein's inequality, for any e > 0, we have 



A k 2 



> e) = P® n 
This and (6.1) and (6.2) imply that 



1 



n r , 



> e < 2 exp 



nh e 



2k 2 + 2k 1 e/3 



(6.3) 



P^ n (As < mo) < 2M z exp(-Cn/i a ), 



where M 2 is the number of elements of the matrix B. Assume in what 
follows that n is large enough so that fiQ > (logn)" 1 . Then for \g > /io we 
have \ f)n(. x ) ~ V( x )\ ^ l 7 ?n P ( x ) ~~ 7 ?( x )l- Therefore, 



(6.4) 



P m (\ff n (x)-r ] (x)\>5)<P® n (\ B <^) 



+ P® n (\ri P (x)- V (x)\>5,\B>Vo). 



We now evaluate the second probability on the right-hand side of (6.4). For 
Xg > we have f)]f(x) = U T (0)Q~ 1 V [where V is given by (2.3)]. Introduce 

the matrix Z = (^i, s )i<i< n ,|s|<L/3J with elements 



Z i>a = (Xi-x)\ K 



Xj-x 
h 



The sth column of Z is denoted by Z s , and we introduce 

A ^ 7]( S \x) 



Z J ol 

\s\<\fl\ 

Since Q = Z T Z, we get 

V|a| < L/5J : U T (0)Q~ 1 Z T Z s = l {a=(0 ,...,o)}. 
hence TP {ti)Q~ x = r}{x) . So we can write 

t£ p (x) - r/(x) = ^(OJQ- 1 ^ - Z T Z^) = l/r^B- 1 *, 
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where a = -j^H(V - Z T Z^) £ R M and H is the diagonal matrix H 
(-ffsi,s 2 )|ai|,|s2|<lj9J with H si,s 2 = h~ Sl l {si=S2 }. For X B > hq we get 
(6.5) \fff{x)-r]{x)\ < ll^all^A^llall^/i^llall <^ l Mmsx\a s \, 
where a s are the components of the vector a given by 



1 



n 



Xj-x 
h 



K 



Xj-x 
h 



Define 



X - x 



A' 



X - x 



We have 

(6.6) \a s \ < 



n 



E^ 

t=i 



+ 



(s,2) _ ET (s,2)i 



i=i 



Note that Er/ 8 '^ =0, |lf s,1) | < /«i/r d and 

VarT. (s,1) <4T l h- d J u 2s K 2 (u)fi{x + hu)du < (K 2 /4)h- d , 

Using Bernstein's inequality, for any £1,62 > 0, we obtain 



-Vr ( 

1 



i=i 



> £1 < 2 exp 



{ «2 



nh d e\ 



and 
Since also 



\ i=l 



Elf' 2) ] 



> £2 < 2 exp 



nh d e\ 



^2/3 }' 
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we get, using (6.6), that if 3fi l MLn2h^ < 5 < 1 the following inequality 
holds: 



for "imT^MLK^hP < 5 (for 6 > 1 inequality (6.7) is obvious since 77* , 77 take 
values in [0,1]). The constants C\,C2 in (6.7) do not depend on the distri- 
bution Px, on its support A and on the point x £ A, so that we get (3.7). 
Now, (3.7) implies (3.8) for Cn - ^/^ + * < 5, and thus for all 5 > (with 
possibly modified constants C\ and C2). 

6.2. Proof of Theorems 3.5 and 4.1. The proof of both theorems is based 
on Assouad's lemma (see, e.g., [13], Chapter 2 or [21], Chapter 2). We apply 
it in a form adapted for the classification problem (Lemma 5.1 in [1]). 

For an integer q > 1 we consider the regular grid on R^ defined as 



Let n q {x) G G q be the closest point to x G R d among points in G q [we 
assume uniqueness of n q {x): if there exist several points in G q closest to 
x we define n q {x) as the one which is closest to 0]. Consider the partition 
X{, . . . , X^ d of [0, l] d canonically defined using the grid G q [x and y belong to 

the same subset if and only if n q (x) = n q (y)]. Fix an integer m < q d . For any 

i G {1, . . . , m}, we define X$ = X[ and Xq = R d \ UELi %h so tna t ^0, • • • 1 X m 
form a partition of R rf . 

Let u : R+ — ► R+ be a nonincr easing infinitely differ entiable function such 
that u = 1 on [0,1/4] and u = on [1/2, 00). One can take, for example, 

u (. x ) = {J1/4 u i(t) tit) -1 J£° v>i(t) dt where the infinitely differentiable func- 
tion u\ is defined as 





<4exp(-Cnh d 5 2 ). 
Combining this inequality with (6.3), (6.4) and (6.5), we obtain 
(6.7) P® n (\f)* n {x) - r](x)\ >5)<d eM~C2nh d 5 2 ) 





for xG (1/4,1/2) 



otherwise. 
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where the positive constant is taken small enough to ensure that \4>{x') — 
M x ')\ <L\\x'-xf for any x,x' e~R d . Thus, 4> S £(/?,£, R d ). 

Define the hypercube TL = {P^ : a = (a\, . . . , a m ) G { — 1, l} m } of probabil- 
ity distributions of (X, Y) on Z = H d x {0, 1} as follows. 

For any P^ G TL the marginal distribution of X does not depend on a, 
and has a density [i w.r.t. the Lebesgue measure on R d defined in the fol- 
lowing way. Fix < w < mT 1 and a set A§ of positive Lebesgue measure 
included in Xq (the particular choice of A$ will be indicated later), and 
take: (i) fi(x) = w/X[B(0, (4g) -1 )] if x belongs to the ball B(z, (4?)- 1 ) for 
some z G G q , (ii) fj,(x) = (1 — mw)/X[Ao] for x G .Ao, (hi) / i (^) = for all 
other x. 

Next, the distribution of Y given X for G TL is determined by the re- 
gression function rjg(x) = P(Y = 1\X = x) that we define as rfff(x) = 1+CT iv( x ) 

for any x G Xj, j = 1, . . . ,m, and r)g = l/2 on Xq, where (f(x) = q~@(fi(q[x — 
n q {x)]). We will assume that Ca, < 1 to ensure that ip and rjg take values in 
[0,1]. 

For any s G N d such that |s| < \_j3\ , the partial derivative D s ip exists and 
x — n q (x)}). Therefore, for any i G {1, . . . , m} and any 

x, x' G Xi, we have 

\<p(x?)-<p x (a/)\ <L\\x-x'\f. 

This implies that for any a G { — 1, l} m the function ry^ belongs to the Holder 
class S(/3,L,R d ). 

We now check the margin assumption. Set xq = . . . , ^-). For any <r G 
{ — 1, l} m we have 

P ? (0 < |^(X) - 1/2| <t) 

= mP ? (0 < - xq)} < 2tq p ) 

f w 

mw f 
= X[B(0A/4)}J my ^^ dx 

= mwl {t > c ^ /(2q0)} . 

Therefore, the margin Assumption (MA) is satisfied if mw = 0{q~ a ^). 
According to Lemma 5.1 in [1], for any classifier f n we have 

(6.8) sup{E-R(/ n ) - R(f*)} > mwb'{l - b^w)/2, 

Pen 

where 



\J 1 — (p 2 (x)fi\ (x) dx 



1/2 



C<t>q~ 



24 



J.-Y. AUDIBERT AND A. B. TSYBAKOV 



b' = (p(x)/j,i(x) dx = C^q 



with n\{x) = n{x)/ J x fi(z)dz. 

We now prove Theorem 3.5. Take q = \Cn l ^ 2 ^ +d ^ \, w = C'q~ d and m = 
\C" q d ~ a P \ with some positive constants C, C and C" to be chosen, and 
set A = [0,l] d \ \J? =1 Xi. The condition a/3 < d ensures that the above 
choice of m is not degenerate: we have m > 1 for C" large enough. We 
now prove that TC C Vt, under the appropriate choice of C, C and C" . In 
fact, select these constants so that the triplet (q,w,m) meets the condi- 
tions m < q d , < w < m~ l and mw = 0(q~ a @). Then, in view of the ar- 
gument preceding (6.8), for any a G { — l,l} m the regression function rjg 
belongs to £(/?, L,H d ) and Assumption (MA) is satisfied. We now check 
that Px obeys the strong density assumption. First, the density fj,(x) equals 
a positive constant for x belonging to the union of balls Ut=i^(- 2 ii (4?)" 1 ), 
where z% is the center of Xi and fJ,(x) = (1 — mw)/(l — mq~ d ) = 1 + o(l), as 
n — > oo, for x G Aq. Thus, // m in A*max for some positive // m in and 

A*max- [Note that this construction does not allow one to choose any pre- 
scribed values of /i m in and /i max > because fj,(x) = 1 + o(l). The problem can 
be fixed via a straightforward but cumbersome modification of the defini- 
tion of Aq that we skip here.] Second, the (co, ro) -regularity of the support 
A of Px with some cq > and ro > follows from the fact that, by con- 
struction, X(AnB(x,r)) = (l+o(l))A([0,l] d n6(x,r)) for all x G A and r > 
(here again we skip the obvious generalization allowing to get any prescribed 
co > 0). Thus, the strong density assumption is satisfied, and we conclude 
that Ji C Vs. Theorem 3.5 now follows from (6.8) if we choose C small 
enough. 

Finally, we prove Theorem 4.1. Take q = LCn 1/((2+Q! ^ +d) J , w = C'q 2f3 /n 
and m = q d for some constants C > 0, C > 0, and choose Aq as a Euclidean 
ball contained in Xq. As in the proof of Theorem 3.5, under the appropriate 
choice of C and C , the regression function r)g belongs to S(/3,L,R rf ) and 
the margin Assumption (MA) is satisfied. Moreover, it is easy to see that the 
marginal distribution of X obeys the mild density assumption [the (co, ir- 
regularity of the support of Px follows from considerations analogous to 
those in the proof of Theorem 3.5] . Thus, Ti C V^. Choosing C small enough 
and using (6.8), we obtain Theorem 4.1. 

6.3. Proof of Theorem 4.2. We prove the theorem for p < oo. The proof 

for p = oo is analogous. For any decision rule / we set d(f) = R(f) — R(f*) 
and 




if •qix) + 

if <q{x) = 1/2, 



Vx g n d . 
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Lemma 6.1. Under Assumption (MA), for any decision rule f we have 
(6.9) Px(f(X) + r*(XJ)) < Cd(f) a K 1+a l 

Proof. Note that /**(•, /) is a Bayes rule, and following the same lines 
as in Proposition 1 of [20], we get P X {f{X) + f**(X,f), rj(X) + 1/2) < 
Cd(f) a ^ 1+a \ It remains to observe that P x (f(X) / f**(X,f), rj{X) / 
l/2)=P x (f(X)^f**(XJ)). □ 

For a Borel function fj on R rf define ffj = l{fj>i/2}> = /**(")/??) an d 

^n(/,") = [12n(/q) " Rn{t n )] ~ [R(ff,) ~ R(%)] = [Rn(ffj) ~ Rn(%)] ~ <W 

Let rj n be an element of M e „ such that \\r) n — rj\\ p < e n , where || • || p is 
the L p (C,Px) norm. It follows from the comparison inequality (5.3) that 

d(f Vn )<Ce£ +a)p/ip+a) = S n .Set 

A n - Cn ~{l+a)p/{(2+a)p+p{p+a)) 

(i.e., A n is of the order of the desired rate). Fix t > and introduce the set 
M: = {veM £n :d(f n )>tA n }. 

For any t > we have 
P(d(f°)>tA n ) 

< p(min [i? n (^) - i?„(/^J] < 
= P( mm [Z n (/ fl ) - Z n (fvJ + - d(f Vn )} < 

< p(ran [Z n (ffj) ~ Zn(fvJ + <W/ 2 + *A n /2 - d(/,J] < 

< p( mm [Z n (ffj) + d{f n )/2] < o) + P(Z n (/„J > tA n /2 - d(/„J) 

< pf min [Z n (/jj) + d{f n )/2] < o) + P(Z n (/„J > tA n /2 - 

Since A n is of the same order as S n , we can choose t large enough to have 
tA n /2 -6 n > iA n /4. Thus, 

P(d(/*) > tA n ) < cardAC max P(Z n (/^) < -d(/jj)/2) 



+ P(Z n (/„J>tA n /4) 
< exp(AV) max P(Z n (/^) < -d(/ 9 )/2) 

+ P(Z n (f Vn )>tA n /4). 
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Note that for any Borel function rj the value Z n (f^) is an average of n 
i.i.d. bounded and centered random variables whose variance does not ex- 
ceed P x {ffj{X) + fr,{X)) < Cd{f^) a ^ l+ °^ [cf. (6.9)]. Thus, using Bernstein's 
inequality we obtain 



/ Cnn 2 \ 

P(- Z „(/ g )>o)<exp(- a + ^ )n/(1+a) ) V a >0. 

Therefore, for f) £ AT*, 



< exp(-C7n(tA n )( 2+a )/( 1+a )). 

Similarly, for t>C, 

CnAl 



P(^ n (/^)>tA n /4)<exp( 
< exp 



V A n + d(/,J°/d+«) 
A n + « /(1+Q) 



The result of the theorem follows now from the above inequalities and the 
relation nA% +a)/{1+a) x e~P. 
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